Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-5625][VL] Support window range frame #5626

Merged
merged 9 commits into from
Jun 12, 2024

Conversation

WangGuangxin
Copy link
Contributor

@WangGuangxin WangGuangxin commented May 6, 2024

What changes were proposed in this pull request?

Support window range frame by pre-project frame offset for non-SpecialFrameBoundary, based on the discussion facebookincubator/velox#7557

(Fixes: #5625)

How was this patch tested?

UT

Copy link

github-actions bot commented May 6, 2024

#5625

Copy link

github-actions bot commented May 6, 2024

Run Gluten Clickhouse CI

5 similar comments
Copy link

github-actions bot commented May 6, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented May 6, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented May 6, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented May 6, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented May 6, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented May 6, 2024

Run Gluten Clickhouse CI

@WangGuangxin
Copy link
Contributor Author

cc @PHILO-HE @rui-mo

@FelixYBW
Copy link
Contributor

FelixYBW commented May 7, 2024

@PHILO-HE is the PR offload SQL like below?

  sum(x) OVER (
    PARTITION BY id,sur
    ORDER BY date RANGE BETWEEN 6 preceding AND CURRENT ROW
  )

@PHILO-HE
Copy link
Contributor

PHILO-HE commented May 7, 2024

@PHILO-HE is the PR offload SQL like below?

  sum(x) OVER (
    PARTITION BY id,sur
    ORDER BY date RANGE BETWEEN 6 preceding AND CURRENT ROW
  )

@FelixYBW, yes, this pr can remove the fallback for such case.

@FelixYBW
Copy link
Contributor

FelixYBW commented May 7, 2024

Let me do a test once merged. Thank you!

@FelixYBW
Copy link
Contributor

FelixYBW commented May 7, 2024

@WangGuangxin can you help to resolve conflicts?

Copy link

github-actions bot commented May 8, 2024

Run Gluten Clickhouse CI

@WangGuangxin
Copy link
Contributor Author

@WangGuangxin can you help to resolve conflicts?

@FelixYBW done

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will spare some time to review in next week. Sorry for late response.

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just added several comments.

return std::make_tuple(
core::WindowNode::BoundType::kFollowing,
exprConverter_->toVeloxExpr(boundType.following().ref(), inputType));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only difference of the two branches is the expression. Can we differentiate the expressions only and reuse the rest of code?

return std::make_tuple(
core::WindowNode::BoundType::kPreceding,
exprConverter_->toVeloxExpr(boundType.preceding().ref(), inputType));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.
I tried to reuse the logic here, but since the type of boundType.preceding() and boundType.following() are different, so I have to pass the offset and ref instead of the expression itself

String upperBound,
String lowerBound,
org.apache.spark.sql.catalyst.expressions.Expression upperBound,
org.apache.spark.sql.catalyst.expressions.Expression lowerBound,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we import org.apache.spark.sql.catalyst.expressions.Expression and use it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Expression is conflict with io.substrait.proto.Expression which alreay imported in this class, and Java doen't have the alias grammer like scala

if (offset < 0) {
Expression.WindowFunction.Bound.Preceding.Builder offsetPrecedingBuilder =
Expression.WindowFunction.Bound.Preceding.newBuilder();
offsetPrecedingBuilder.setOffset(0 - offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need -offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need. In fact, this branch is the existing logic


// the reference to pre-project range frame boundary.
Expression ref = 2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your great work! Just posted some comments. Please check whether it makes sense.

throw new UnsupportedOperationException(
"Unsupported Window Function Frame Type:" + boundType);
}
} else if (boundType.foldable()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if branch is just for ClickHouse backend? It would be nice to leave some comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clickhouse backend and velox backend with ROW frame. Comments added

}
} catch (NumberFormatException e) {
throw new UnsupportedOperationException(
"Unsupported Window Function Frame Type:" + boundType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add statements to catch this exception? I think Spark parser has already checked the validity of foldable bound (e.g., 100a is not valid). The catching looks unnecessary.

case swf: SpecifiedWindowFrame if needPreComputeRangeFrame(swf) =>
// This is guaranteed by Spark, but we still check it here
if (orderSpecs.size != 1) {
throw new GlutenNotSupportException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use GlutenException. Generally GlutenNotSupportException is caught in fallback handling for some limitation cases. But for this case, I think we can directly let a runtime exception happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return std::make_tuple(
core::WindowNode::BoundType::kFollowing,
std::make_shared<core::ConstantTypedExpr>(BIGINT(), variant(boundType.following().offset())));
specifiedBound(following.has_offset(), following.offset(), following.ref()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can has_offset be true for Velox backend after the java side handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's reused both for ROW and RANGE frame, so offset is used for ROW frame here.

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor comments. We can merge it after CI is fixed. Thanks!

offsetFollowingBuilder.setOffset(offset);
builder.setFollowing(offsetFollowingBuilder.build());
}
} catch (NumberFormatException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks we can also remove this try/catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
} catch (NumberFormatException e) {
} else {
throw new UnsupportedOperationException(
"Unsupported Window Function Frame Type:" + boundType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not frame type, but bound type, please help fix it in your pr.

"Unsupported Window Function Frame Bound Type: " + boundType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

Run Gluten Clickhouse CI

@WangGuangxin
Copy link
Contributor Author

@PHILO-HE @rui-mo Can you please review it again?

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Jun 4, 2024

@PHILO-HE @rui-mo Can you please review it again?

@WangGuangxin, the pr looks good to me. Could you fix the CI failure? Looks relevant to this pr.

Copy link

github-actions bot commented Jun 4, 2024

Run Gluten Clickhouse CI

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Jun 4, 2024

@WangGuangxin, please also rebase the code. Thanks!

Copy link

github-actions bot commented Jun 4, 2024

Run Gluten Clickhouse CI

Comment on lines +163 to +167
val a = expressionMap
.getOrElseUpdate(
bound.canonicalized,
Alias(Add(orderSpec.child, bound), generatePreAliasName)())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really support reuse the reference for the same bound literal ? What happens if the two window expressions have the same frame but requires different ordering column.

  sum(x) OVER (
    PARTITION BY c1
    ORDER BY c2 RANGE BETWEEN 1 preceding AND CURRENT ROW
  ), 
  sum(x) OVER (
    PARTITION BY c1
    ORDER BY c3 RANGE BETWEEN 1 preceding AND CURRENT ROW
  )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really support reuse the reference for the same bound literal ? What happens if the two window expressions have the same frame but requires different ordering column.

  sum(x) OVER (
    PARTITION BY c1
    ORDER BY c2 RANGE BETWEEN 1 preceding AND CURRENT ROW
  ), 
  sum(x) OVER (
    PARTITION BY c1
    ORDER BY c3 RANGE BETWEEN 1 preceding AND CURRENT ROW
  )

let me check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frame bouding is a reference, which will be pre-project before window operator. And the sort/partition are calculated in window operator. So it's ok to reuse it.
I add a UT for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new added UT succeed in spark3.2, but failed in spark3.4/3.5. I'll fix it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late response, all UT passed now, the failed one seems unrelated. Please help review again @PHILO-HE @ulysses-you

The reason why it was failed is because sort column l_discount is a decimal, and when we do Add(sort_col, offset), the precision of decimal type will change, so that the data type check failed in Spark (https://github.com/apache/spark/blob/82a84ede6a47232fe3af86672ceea97f703b3e8a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L83)

In fact, we only support ByteType | ShortType | IntegerType | LongType | DateType by design, which is checked in https://github.com/apache/incubator-gluten/pull/5626/files#diff-68925b6a5ee67a942a976971ae6727a45a0b166d200d8d983454169e7c3bd1f6R316, but even when WindowExec fallback by this check, the PullOutPreProjection rule will still be executed.

Copy link

github-actions bot commented Jun 5, 2024

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link
Contributor

@ulysses-you ulysses-you left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @WangGuangxin

@ulysses-you ulysses-you merged commit 7e217cf into apache:main Jun 12, 2024
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Support window range frame
5 participants